{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas常用数据类型及其操作\n",
    "## 一.Series的创建及其常见操作\n",
    "## 二.DataFrame的创建及其常见操作\n",
    "## 三.DatetimeIndex的创建及其常见操作\n",
    "## 四.Pandas读取和存储文件\n",
    "## 知识目标：理解Series和DataFrame两种不同的对象\n",
    "## 技能目标：掌握Series和DataFrame的创建方法及常见操作\n",
    "## 重点：DataFrame的创建方法及常见操作\n",
    "## 难点：Pandas读取和存储数据库文件\n",
    "\n",
    "=========================================================\n",
    "\n",
    "知识储备：\n",
    "#Pandas是Python中用于处理数据和分析数据的第三方库，提供高性能易用数据类型和分析工具。https://pandas.pydata.org/ (代码、文档、参考代码等)\n",
    "#引用：import pandas as pd（pd引入模块的别名）\n",
    "#Pandas基于NumPy实现的，常与NumPy、Matplotlib一同使用。\n",
    "#Pandas与NumPy的联系与区别：\n",
    "\n",
    "1.Pandas是NumPy的扩展库；\n",
    "\n",
    "2.NumPy关注数据的结构表达，体现在维度数据间的关系；\n",
    "\n",
    "3.Pandas关注数据的应用表达，体现在数据与索引之间的关系。\n",
    "#常见数据对象：\n",
    "\n",
    "1.Series（一维的数据结构）\n",
    "\n",
    "2.DataFrame（二维的数据结构）\n",
    "\n",
    "3.DatetimeIndex（时间序列）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例1:Pandas与NumPy的区别\n",
    "a = np.array([11,12,13,14,15])\n",
    "s = pd.Series([11,12,13,14,15])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一.Series的创建及其常见操作\n",
    "Series是一个类似于一位数组的对象，由一组数据和与之相关的索引两部分构成。\n",
    "### 1.创建Series类对象\n",
    "#### （1）Python列表法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例1中s的创建方式就是Python列表法，索引参数index可以省略，省略后为自动索引值\n",
    "# 例2：用Python列表创建Series\n",
    "s1 = pd.Series([11,12,13,14,15],index=['a','b','c','d','e']) #指定索引值\n",
    "s1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）标量值法（相同的1个值）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例3：用标量值创建Series\n",
    "s2 = pd.Series(100,index=['a','b','c','d','e'])\n",
    "s2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （3）Python字典法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例4：用字典类型创建Series\n",
    "s3 = pd.Series({'语文':90, '数学':92, '英语':98, '物理':87, '化学':92})\n",
    "s3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （4）一维ndarray法或其他函数法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例5：用ndarray类型创建Series\n",
    "s4 = pd.Series(np.arange(1,5),index=np.arange(5,9))\n",
    "s4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例6：用其他函数创建Series\n",
    "s5 = pd.Series(range(5))\n",
    "s5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.Series对象的基本操作\n",
    "#### （1）常用属性\n",
    "通过：values，index，dtypes\n",
    "\n",
    "获取：数据， 索引， 类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例7：获取Series的数据和索引\n",
    "s1 = pd.Series([11,12,13,14,15],index=['a','b','c','d','e'])\n",
    "s1.index\n",
    "s1.values\n",
    "s1.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）查改数据\n",
    "通过自动索引或自定义索引查找数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例8：查找某个数据\n",
    "s1['c']     #自定义索引\n",
    "s1[2]       #自动索引 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例9：查找某些数据\n",
    "s1[['c','d']]\n",
    "s1[[2,3]]\n",
    "s1[[1,'c','d']]   #自定义索引和自动索引并存，但不能混用\n",
    "s1[2:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例9：修改数据\n",
    "s1['a']=0\n",
    "s1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### （3）常见运算\n",
    "两个Series对象之间可以进行四则运算和幂运算,但是只对两个Series对象中都有的索引对应的值进行计算，非共同索引对应的值则为空值NaN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例10：两个Series对象相加\n",
    "s6 = pd.Series([1,2,3],['c','d','e'])\n",
    "s7 = pd.Series([9,8,7,6],['a','b','c','d'])\n",
    "s6+s7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例11：在s3的每行索引后面加上_张三\n",
    "s3.add_suffix('_张三')  #如果在索引前面加则用prefix()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例12：查看s3中最大值的索引\n",
    "s3.idxmax()     #注意idxmax与argmax函数的区别，前者适用于Series和DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例13：查看s3中90分以上的数据\n",
    "s3[s3>90]\n",
    "# 例14：查看s3中大于中值的数据\n",
    "s3[s3>s3.median()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例14：s3的值乘以60%，保留1位小数\n",
    "round(s3*0.6,1)   #Round函数是按照指定的小数位进行四舍五入运算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二.DataFrame的创建及其常见操作\n",
    "#DataFrame是一个类似于二维数组或表格的对象，它每列的数据可以是不同的数据类型，每列共用相同索引。\n",
    "#DataFrame不仅有行索引（index），还有列索引（column）。\n",
    " \n",
    "    0     1     2\n",
    "    \n",
    "0\tdata_1  data_a\tdata_w\n",
    "\n",
    "1\tdata_2\tdata_b\tdata_x\n",
    "\n",
    "2\tdata_3\tdata_c\tdata_y\n",
    "\n",
    "3\tdata_4\tdata_d\tdata_z\n",
    "\n",
    "### 1.创建DataFrame类对象\n",
    "#### （1）二维ndarray对象法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例15：用二维ndarray对象创建DataFrame\n",
    "d1 = pd.DataFrame(np.arange(12).reshape(3,4))\n",
    "d1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）由一维ndarray、列表、字典、元组或Series构成的字典"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例16：用列表类型的字典创建DataFrame\n",
    "data = {'one':[1,2,3,4],'two':[9,8,7,6]}\n",
    "d2 = pd.DataFrame(data,index=['a','b','c','d'])\n",
    "d2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考1：如何将下列表格中的数据创建成DataFrame对象(行索引自定义为c(n))\n",
    "\n",
    "2019年7月部分城市新建住宅价格指数\n",
    "\n",
    "城市---环比---同比---定基\n",
    "\n",
    "北京---101.5--120.7--121.4\n",
    "\n",
    "上海---101.2--127.3--127.8\n",
    "\n",
    "广州---101.3--119.4--120.0\n",
    "\n",
    "深圳---102.0--140.9--145.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {'城市':['北京','上海','广州','深圳'],\n",
    "        '环比':[ 101.5, 101.2, 101.3, 102.0],\n",
    "        '同比':[ 120.7, 127.3, 119.4, 140.9],\n",
    "        '定基':[ 121.4, 127.8, 120.0, 145.5]}\n",
    "d3 = pd.DataFrame(data,index=['c1','c2','c3','c4'])\n",
    "d3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.DataFrame类型的基本操作\n",
    "#### （1）常用属性\n",
    "index、columns、values "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d3.index\n",
    "d3.colunms\n",
    "c3.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）查询数据\n",
    "loc，iloc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例17：查询某列数据\n",
    "d3['同比'] 能否查询“同比”列数据?\n",
    "# 例18：查询某行数据\n",
    "d3['c2']   能否查询“上海”行数据?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d3[1:2]\n",
    "d3.loc['c2']\n",
    "d3.iloc[1,:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "loc，iloc的区别：\n",
    "#DataFrame.loc[行索引名称或条件，列索引名称]\n",
    "#DataFrame.iloc[行索引位置，列索引位置]\n",
    "#使用loc，如果内部传入的索引位置是一个区间，则前后均为闭区间；使用iloc，则为前闭后开区间。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例19：查询某个值\n",
    "d3.loc['c2','定基']\n",
    "d3.iloc[1,3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例20：如果行索引index为自动索引，查看loc，iloc的区别\n",
    "data = {'城市':['北京','上海','广州','深圳'],\n",
    "        '环比':[ 101.5, 101.2, 101.3, 102.0],\n",
    "        '同比':[ 120.7, 127.3, 119.4, 140.9],\n",
    "        '定基':[ 121.4, 127.8, 120.0, 145.5]}\n",
    "d4 = pd.DataFrame(data)\n",
    "d4.loc[1:3]\n",
    "d4.iloc[1:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考2：查询同比值大于120的城市，并只显示城市名和同比值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d3.loc[d3['同比']>120,['城市','同比']]     #用loc查询数据的优势是能够实现条件查询"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （3）更改数据\n",
    "原理同Series一样，对原始直接更改，随时更改即时生效，更改前务必确认或对原数据进行备份。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例21：为d3增加一列数据\n",
    "d3['评价指数']=100     #添加一列的方法只需要新建一个列索引，并对该索引下的数据进行赋值操作即可。\n",
    "d3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （4）删除数据\n",
    "#删除列：DataFrame.drop（labels，axis=1）\n",
    "#删除行：DataFrame.drop（labels，axis=0）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例22：删除某一列\n",
    "d3.drop(labels='定基',axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 三.DatetimeIndex的创建及其常见操作\n",
    "#Timestamp（时间戳）：最基础的时间类。表示某个时间点。在绝大多数的场景中的时间数据都是Timestamp形式的时间。\n",
    "#Period（固定时期）：表示单个时间跨度，或者某个时间段，例如某一天，某一小时等。\n",
    "#Timedelta：表示不同单位的时间，例如1天，1.5小时，3分钟，4秒等，而非具体的某个时间段。\n",
    "#DatetimeIndex：一组Timestamp构成的Index，可以用来作为Series或者DataFrame的索引。\n",
    "#PeriodtimeIndex：一组Period构成的Index，可以用来作为Series或者DataFrame的索引。\n",
    "#TimedeltaIndex：一组Timedelta构成的Index，可以用来作为Series或者DataFrame的索引。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.Timestamp、DatetimeIndex的创建\n",
    "#### （1）to_datetime()转换法\n",
    "to_datetime()函数主要用于将时间字符串转换为时间戳"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例23：将datetime转换为Timestamp对象\n",
    "t1 = pd.to_datetime('20201010')\n",
    "t1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例24：将datetime转换为DatetimeIndex对象\n",
    "t2 = pd.to_datetime(['20201010','20201011','20201012','20201013','20201014'])\n",
    "t2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）date_range()创建法\n",
    "date_range()函数主要用于生成一个具有固定频率的DatetimeIndex对象。\n",
    "\n",
    "date_range()语法date_range(start=None,end=None,periods=None,freq='D',tz=None,normalize=False,name=None,closed=None,**kwargs)\n",
    "#start:指定起始日期，默认为None\n",
    "#end：指定终止日期，默认为None\n",
    "#periods：指定要生成的时间戳数量\n",
    "#freq：指定时间间隔，默认为以自然日（1天）为单位，相邻两个时间间隔为1天，也可以指定计时单位，比如“5D”表示每隔5天\n",
    "#tz：表示时区\n",
    "#normalize:接收布尔值，默认为False。如果设为True，那么在产生时间戳索引值之前，会将start和end都转化为当日的午夜0点\n",
    "#closed：表示start和end区间端点是否包含在区间内，可以取值left（左闭右开），right（左开右闭），None（两端都是闭区间）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例25：创建DatetimeIndex对象，要求从2020年7月1日到今天，以一周为时间周期生成时间戳\n",
    "# Y表示年，且为年末最后一天，YS表示年初第一天\n",
    "# M表示月，且为月末最后一天，MS表示月初第一天\n",
    "# W表示周，默认W-SUN，表示按每周日计算一次\n",
    "# D表示日\n",
    "# H表示小时\n",
    "# T表示分钟\n",
    "t3 = pd.date_range(start='20200701',end='20201014',freq='W')\n",
    "t3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （3）datetime对象传参法\n",
    "时间序列不是孤立存在的，如何希望DataFrme对象有时间戳索引，可以将多个datetime对象的列表作为参数传给index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例26：为DataFrme对象创建时间戳索引\n",
    "from datetime import datetime\n",
    "data_list = {'one':[1,2,3,4,],'two':[9,8,7,6]}\n",
    "date_list = [datetime(2020,1,23),datetime(2020,3,30),datetime(2020,5,18),datetime(2020,7,10)]\n",
    "d5 = pd.DataFrame(data_list,index=date_list)\n",
    "d5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考3：模拟生成2020年10月10日各院系新生每小时报到人数（列举3个学院即可）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_demo = np.random.randint(30,100,(10,3))\n",
    "date_list = pd.date_range(start='202010100800',end='202010101700',freq='H')\n",
    "d6 = pd.DataFrame(data_demo,index=date_list,columns=['数据信息学院','艺术设计学院','医药护理学院'])\n",
    "d6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.Timestamp、DatetimeIndex的常见操作\n",
    "#### （1）提取时间序列数据信息（常用属性）\n",
    "#year\t年\n",
    "#month\t月\n",
    "#day\t日\n",
    "#hour\t小时\n",
    "#minute\t分钟\n",
    "#second\t秒\n",
    "#date\t日期\n",
    "#time\t时间\n",
    "#week\t一年中第几周\n",
    "#quarter\t季节\n",
    "#weekofyear\t一年中第几周\n",
    "#dayofyear\t一年中第几天\n",
    "#dayofweek\t一周第几天\n",
    "#weekday\t一周第几天\n",
    "#weekday_name\t星期名称\n",
    "#is_leap_year\t是否闰年"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例27：提取d6中时间序列的相关属性\n",
    "d6_year = [i.year for i in date_list]\n",
    "d6_year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考4：将订单信息表中的lock_time列的时间数据的相关信息（年份，月份，日期，星期名称等，取前5行数据即可）\n",
    "\n",
    "提示：读取csv文件使用pd.read_csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考5：查看订单信息表中最短用餐时间、最长用餐时间、平均用餐时间及整体订单的时间跨度；有无异常值？\n",
    "\n",
    "提示：lock_time为结算时间；      use_start_time为点餐时间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order_data = pd.read_csv('D:/pyData/meal_order_info.csv',encoding='gbk')\n",
    "order_data\n",
    "order_data.columns\n",
    "order_data['lock_time'] = pd.to_datetime(order_data['lock_time'])\n",
    "order_data['use_start_time'] = pd.to_datetime(order_data['use_start_time'])\n",
    "checktime = order_data['lock_time']-order_data['use_start_time']\n",
    "checktime.min()\n",
    "checktime.max()\n",
    "checktime.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### （2）重采样\n",
    "重采样是指将时间序列从一个频率转换到另一个频率的处理过程。如果是将高频率数据聚合到低频率，比如将每日采集的频率变成没有采集，则称为降采样；如果将低频率数据转换为高频率数据，则称为升采样。另一种既不属于降采样，也不属于升采样，比如，将采集数据的时间由周日改为周一。\n",
    "\n",
    "resample()函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 例28：修改d6数据采集频率，由1小时改为2小时。\n",
    "d6.resample('4H').sum()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
